Please locate your vsp_bigdata folder until “My Documents” and navigate to group-session. Create 08-lecture folder under the group-session folder.
For this group session, we will use the Gapminder database.
Please download this CSV file and save it under the group session folder: Gapminder data 2016
Now, you need to download the gapminder geographic data (coordinates): Gapminder geographic coordinates
The purpose of this group session is to get you familiar with visualization and web mapping. We will go through three parts
.
First, we will join the gapminder data with the geographic coordinates
Second, we will create a few plots showing the relationship between income and life expectancy.
Third, we will create a map showing each of the variables. Then we will The map will export the map as an interactive web map that can be shared with other people.
First, we are going to join the gapminder data with the geographic coordinates data. Let’s load the libraries first.
# Install packages
# install.packages("leaflet")
# install.packages("dplyr")
# install.packages("magrittr")
# install.packages("ggplot2")
# install.packages("plotly")
# Load packages
library(leaflet)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Let’s load the data.
# Paths
# gapminder = read.csv(file.choose())
# geo = read.csv(file.choose())
gapminder = read.csv("/Users/andyhong/Documents/7-Teaching/Urban Big Data Analytics/03-GroupSessions/08-group-data_visualization/gapminder_data_2016.csv")
geo = read.csv("/Users/andyhong/Documents/7-Teaching/Urban Big Data Analytics/03-GroupSessions/08-group-data_visualization/gapminder_geo.csv")
Let’s examine the data with the head() function.
head(gapminder)
## name region income lifeExp
## 1 Afghanistan asia 1740 58.0
## 2 Albania europe 11400 77.7
## 3 Algeria africa 14000 77.4
## 4 Andorra europe 48200 82.5
## 5 Angola africa 6030 64.7
## 6 Antigua and Barbuda americas 20800 77.3
head(geo)
## name lat long population
## 1 Afghanistan 33.00000 66.00000 34700000
## 2 Albania 41.00000 20.00000 2930000
## 3 Algeria 28.00000 3.00000 40600000
## 4 Andorra 42.50779 1.52109 77300
## 5 Angola -12.50000 18.50000 28800000
## 6 Antigua and Barbuda 17.05000 -61.80000 101000
We can also view the data in a more familiar tabular format.
# View(gapminder)
# View(geo)
Another way to see the “structure” of the dataset is to run the str() function.
str(gapminder)
## 'data.frame': 187 obs. of 4 variables:
## $ name : Factor w/ 187 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ region : Factor w/ 4 levels "africa","americas",..: 3 4 1 4 1 2 2 4 3 4 ...
## $ income : int 1740 11400 14000 48200 6030 20800 18500 8170 44400 44100 ...
## $ lifeExp: num 58 77.7 77.4 82.5 64.7 77.3 76.7 75.7 82.5 81.5 ...
str(geo)
## 'data.frame': 187 obs. of 4 variables:
## $ name : Factor w/ 187 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 33 41 28 42.5 -12.5 ...
## $ long : num 66 20 3 1.52 18.5 ...
## $ population: int 34700000 2930000 40600000 77300 28800000 101000 43800000 2920000 24100000 8710000 ...
You will notice that the 1st 2 columns/variables “name” and “region” are both “Factor” type variables. This means that they are texts, or more precisely, categorical variables. “income”" and “lifeExp” are “int” Integer and “num” Numeric type variables.
The geo dataset contain name as well as lat, long, and population columns. “lat” and “long” columns are “num” Numeric type variables, and population is “int” Interger type variable.
The built in function summary() in base R does a good simple summary statistics for all variables in the dataset provided. Since this dataset only has 4 variable, we can simply call summary(gapminder) which will give us the summary statistics for all 4 variables.
summary(gapminder)
## name region income lifeExp
## Afghanistan : 1 africa :54 Min. : 625 Min. :50.30
## Albania : 1 americas:34 1st Qu.: 3325 1st Qu.:66.65
## Algeria : 1 asia :54 Median : 10800 Median :73.50
## Andorra : 1 europe :45 Mean : 17351 Mean :72.21
## Angola : 1 3rd Qu.: 23850 3rd Qu.:77.65
## Antigua and Barbuda: 1 Max. :118000 Max. :83.90
## (Other) :181
It looks like the column name is common across the two data sets. Now, let’s join the data together to prepare for mapping later. We are going to use inner_join so that we only choose countries with complete geographic data.
# Join gapminder data and the geographic coordinates
gapminder = gapminder %>% inner_join(geo, by="name")
# Check the joined data
head(gapminder)
## name region income lifeExp lat long
## 1 Afghanistan asia 1740 58.0 33.00000 66.00000
## 2 Albania europe 11400 77.7 41.00000 20.00000
## 3 Algeria africa 14000 77.4 28.00000 3.00000
## 4 Andorra europe 48200 82.5 42.50779 1.52109
## 5 Angola africa 6030 64.7 -12.50000 18.50000
## 6 Antigua and Barbuda americas 20800 77.3 17.05000 -61.80000
## population
## 1 34700000
## 2 2930000
## 3 40600000
## 4 77300
## 5 28800000
## 6 101000
Now, let’s create some plots for exploratory data anlysis. We will first create a boxplot showing the life expectancy grouped by different continents.
ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
Let’s first explore the relationship between income and life expectancy. What relationship do we expect to see?
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point()
We can be a little fancy by adding a smooth trend line.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can also color different continents.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can color the trend lines as well.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth(aes(color = region))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Or we can just cut the data and show each continent separately
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth() +
facet_grid(.~region)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The X axis is hard to read. Let’s rotate the texts.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth() +
facet_grid(.~region) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Lastly, we can also make the plot interactive, so that we can see which dot represents which country. Note that we added a text option in geom_point to include country names.
p = ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(text = paste("Country:", name), color = region)) +
geom_smooth() +
facet_grid(.~region) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown aesthetics: text
ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In part three, we will create an interactive web map using the gapminder data. Ealier we jointed the coordinate information to the gapminder dataset, so we can use the coordinates to show countries on the map. We will use the powerful leaflet package to accomplish this task. First, we will just plot the points on the map, and use addCircleMarker function to visualize variales on the map.
Let’s initiate leaflet add the empty map tiles. We can use different map tiles available here: http://leaflet-extras.github.io/leaflet-providers/preview/
leaflet(gapminder) %>% addTiles()
# leaflet(gapminder) %>% addProviderTiles(provider = "Stamen.TonerLite")
# leaflet(gapminder) %>% addProviderTiles(provider = "Stamen.Toner")
# leaflet(gapminder) %>% addProviderTiles(provider = "Esri.WorldImagery")
Now, let’s add the latitude and longitude points on the map. Note that we use the squiggly ~ sign to use the column names without the data name.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat)
Let’s use the variable income to visualize income levels on the map.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~income)
What do you see on the screen? Why is it all blue?
The income data ranges from $650 to $118,000.
gapminder %>% summarise(min=min(income), max=max(income))
## min max
## 1 625 118000
We need to scale the data to visualize it on the map. We will first divide the income by 1000 and take the square root to scale the data exponentially.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(income/1000))
Congratulations! You created your first interactive map.
Now it’s a lot better, but we don’t know which country is which, and the points don’t show anything if we hover over them. Let’s label the point with the country name.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(income/1000), label=~name)
We can see the country name when we move the mouse over to each circle. One way to make all these fancy is to create a variable column that computes our variable of interest to scale appropriately and give an appropriate label.
Note that we can chain multiple variables through the mutate function.
gapminder = gapminder %>% mutate(variable = income/1000,
label = paste(name, "- Income: ", variable, "k"))
Now, let’s see the final map with an appropriate scale and label.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(variable), label=~label, weight=2)
For fun, we can color each circle according to its continent and look for any spatial patterns.
pal = colorFactor(rainbow(4), gapminder$region)
leaflet(gapminder) %>%
addTiles() %>%
addCircleMarkers(~long, ~lat, radius=~sqrt(variable), label=~label, weight=2, color=~pal(region))